An optimized algorithm for Data Oriented Parsing
نویسنده
چکیده
This paper presents an optimization of a syntactic disambiguation algorithm for Data Oriented Parsing (DOP) (Bod 93) in particular, and for Stochastic Tree-Substitution Grammars (STSGs) in general. The main advantage of this algorithm on existing alternatives ((Bod 93), (Schabes & Waters 93), (Sima'an et al. 94)) is that its time-complexity is linear, instead of square, in grammar-size (and cubic in sentence length). It is particularly suitable for natural language STSGs which have many deep elementary-trees and a small underlying Context-Free Grammar (CFG). A rst implementation of this algorithm is operational and is exhibiting substantial speed up in comparison to the unop-timized version. In addition to presenting the optimized algorithm, the paper reports experiments for measuring the disambiguation-accuracy, the expected sizes and the execution-times of various DOP models, which are projected from the ATIS domain. 1 Motivation Many models of natural language performance tend to train presupposed grammars in order to extend them probabilistically (e.g. (Schabes & Waters 93), (Black et al. 93)). In contrast , Data Oriented Parsing (DOP), suggested by Scha (Scha 90) and developed by Bod (Bod 92), projects an STSG directly from a given tree-bank. DOP projects an STSG by decomposing each tree in the tree-bank in all ways, at zero or more internal nodes each time, obtaining a set of constituent structures, which then serves as the elementary-trees set of an STSG. An STSG is basically a Context-Free Grammar (CFG) with \rules" (or \productions") which have internal structure i.e are (elementary-)trees. Deriving a parse for a given sentence in STSG is combining elementary-trees using the same substitution operation as used by CFGs. In contrast to CFGs, however, STSGs allow various derivations to generate the same parse. Crucial for natural language disambiguation, the set of trees generated by combining the elementary-trees of an STSG are not always gen-erateable by a CFG; thus, STSGs impose extra constraints on the generated structures. For selecting a distinguished structure from the space of generated structures for a given sentence, DOP assigns probabilities to the application of elementary-trees in derivations. The probability, which DOP inferres for each elementary-tree, is the ratio between the number of its appearances in the tree-bank (i.e. either as a tree or as a sub-tree) and the total number of appearances of all elementary-trees which share with it the same root non-terminal (see gure 1). A derivation's probability is then deened as the multiplication of the probabilities of …
منابع مشابه
An improved joint model: POS tagging and dependency parsing
Dependency parsing is a way of syntactic parsing and a natural language that automatically analyzes the dependency structure of sentences, and the input for each sentence creates a dependency graph. Part-Of-Speech (POS) tagging is a prerequisite for dependency parsing. Generally, dependency parsers do the POS tagging task along with dependency parsing in a pipeline mode. Unfortunately, in pipel...
متن کاملDarwinised Data-Oriented Parsing - Statistical NLP with Added Sex and Death
We present the Darwinised DataOriented Parsing algorithm, an incremental, dy-namic form of Data-Oriented Parsing, in which exemplars are used as replicators, subject to a selection pressure towards gen-eralisability.
متن کاملAn Optimized PID for Capsubots using Modified Chaotic Genetic Algorithm (RESEARCH NOTE)
This paper proposes a design for a mesoscale capsule robot which can be used in gaining diagnostic data and delivering medical treatment in inaccessible parts of the human body. A novel approach is presented for the capsule robot control: A PID-controlled closed-loop approach. A modified chaotic genetic algorithm will be used to optimize the coefficients of PID controller. Then, simulation will...
متن کاملIdentifying Flow Units Using an Artificial Neural Network Approach Optimized by the Imperialist Competitive Algorithm
The spatial distribution of petrophysical properties within the reservoirs is one of the most important factors in reservoir characterization. Flow units are the continuous body over a specific reservoir volume within which the geological and petrophysical properties are the same. Accordingly, an accurate prediction of flow units is a major task to achieve a reliable petrophysical description o...
متن کاملAN EXPERIMENTAL INVESTIGATION OF THE SOUNDS OF SILENCE METAHEURISTIC FOR THE MULTI-MODE RESOURCE-CONSTRAINED PROJECT SCHEDULING WITH PRE-OPTIMIZED REPERTOIRE ON THE HARDEST MMLIB+ SET
This paper presents an experimental investigation of the Sounds of Silence (SoS) harmony search metaheuristic for the multi-mode resource-constrained project scheduling problem (MRCPSP) using a pre-optimized starting repertoire. The presented algorithm is based on the time oriented version of the SoS harmony search metaheuristic developed by Csébfalvi et al. [1] for the single-mode resource-con...
متن کامل